Goto

Collaborating Authors

 speech-to-text system


Evaluating Speech-to-Text Systems with PennSound

Wright, Jonathan, Liberman, Mark, Ryant, Neville, Fiumara, James

arXiv.org Artificial Intelligence

A random sample of nearly 10 hours of speech from PennSound, the world's largest online collection of poetry readings and discussions, was used as a benchmark to evaluate several commercial and open-source speech-to-text systems. PennSound's wide variation in recording conditions and speech styles makes it a good representative for many other untranscribed audio collections. Reference transcripts were created by trained annotators, and system transcripts were produced from AWS, Azure, Google, IBM, NeMo, Rev.ai, Whisper, and Whisper.cpp. Based on word error rate, Rev.ai was the top performer, and Whisper was the top open source performer (as long as hallucinations were avoided). AWS had the best diarization error rates among three systems. However, WER and DER differences were slim, and various tradeoffs may motivate choosing different systems for different end users. We also examine the issue of hallucinations in Whisper. Users of Whisper should be cautioned to be aware of runtime options, and whether the speed vs accuracy trade off is acceptable.


The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems

Liesenfeld, Andreas, Lopez, Alianda, Dingemanse, Mark

arXiv.org Artificial Intelligence

Speech recognition systems are a key intermediary in voice-driven human-computer interaction. Although speech recognition works well for pristine monologic audio, real-life use cases in open-ended interactive settings still present many challenges. We argue that timing is mission-critical for dialogue systems, and evaluate 5 major commercial ASR systems for their conversational and multilingual support. We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge (study 1). This impacts especially the recognition of conversational words (study 2), and in turn has dire consequences for downstream intent recognition (study 3). Our findings help to evaluate the current state of conversational ASR, contribute towards multidimensional error analysis and evaluation, and identify phenomena that need most attention on the way to build robust interactive speech technologies.


How to Build a Speech-to-Text System using ChatGPT and Python - Pyresearch - Medium

#artificialintelligence

Check out our latest tutorial on how to build a speech-to-text system using ChatGPT and Python! Learn how to leverage the power of natural language processing and deep learning to convert audio to text with amazing accuracy. Please let me know your valuable feedback on the video by means of comments. Please like and share the video. Do not forget to subscribe to my channel for more educational videos.


Voice-based applications for E-Health

#artificialintelligence

Healthcare has been one of the countless beneficiaries of the revolutionary advances that widespread computing has brought. Fast, efficient data organization, storage, and access that have greatly sped up the medical enterprise, yet many low hanging fruits remain hanging. Chief among those is the increased application of technologies that can process speech. In this post, we'll share with you how speech technology can improve healthcare in the three following ways. Finally, (3) voice signal analysis can be used for earlier diagnosis and to help track the changes in medical conditions over time.


Voice-based applications for E-Health – H2020 COMPRISE

#artificialintelligence

Healthcare has been one of the countless beneficiaries of the revolutionary advances that widespread computing has brought. Fast, efficient data organisation, storage and access that have greatly sped up the medical enterprise, yet many low hanging fruits remain hanging. Chief among those is the increased application of technologies that can process speech. In this post, we'll share with you how speech technology can improve healthcare in the three following ways. Finally, (3) voice signal analysis can be used for earlier diagnosis and to help track the changes of medical condition over time.


Usage of speaker embeddings for more inclusive speech-to-text

AIHub

English is one of the most widely used languages worldwide, with approximately 1.2 billion speakers. In order to maximise the performance of speech-to-text systems it is vital to build them in a way that recognises different accents. Recently, spoken dialogue systems have been incorporated into various devices such as smartphones, call services, and navigation systems. These intelligent agents can assist users in performing daily tasks such as booking tickets, setting-up calendar items, or finding restaurants via spoken interaction. They have the potential to be more widely used in a vast range of applications in the future, especially in the education, government, healthcare, and entertainment sectors.


AI learns how to fool text-to-speech. That's bad news for voice assistants

#artificialintelligence

A pair of computer scientists at the University of California, Berkeley developed an AI-based attack that targets speech-to-text systems. With their method, no matter what an audio file sounds like, the text output will be whatever the attacker wants it to be. This one is pretty cool, but it's also another entry for the "terrifying uses of AI" category. The team, Nicholas Carlini and Professor David Wagner, were able to trick Mozilla's popular DeepSpeech open-source speech-to-text system by, essentially, turning it on itself. Given any audio waveform, we can produce another that is over 99.9% similar, but transcribes as any phrase we choose (at a rate of up to 50 characters per second) … Our attack works with 100% success, regardless of the desired transcription, or initial source phrase being spoken.